6 research outputs found

    A free/open-source hybrid morphological disambiguation tool for Kazakh

    Get PDF
    This paper presents the results of developing a morphological disambiguation tool for Kazakh. Starting with a previously developed rule-based approach, we tried to cope with the complex morphology of Kazakh by breaking up lexical forms across their derivational boundaries into inflectional groups and modeling their behavior with statistical methods. A hybrid rule-based/statistical approach appears to benefit morphological disambiguation demonstrating a per-token accuracy of 91% in running text

    A free/open-source hybrid morphological disambiguation tool for Kazakh

    Get PDF
    This paper presents the results of developing a morphological disambiguation tool for Kazakh. Starting with a previously developed rule-based approach, we tried to cope with the complex morphology of Kazakh by breaking up lexical forms across their derivational boundaries into inflectional groups and modeling their behavior with statistical methods. A hybrid rule-based/statistical approach appears to benefit morphological disambiguation demonstrating a per-token accuracy of 91% in running text

    Технології вирівнювання та розширення паралельних корпусів казахської мови

    No full text
    The paper presents the two-stage alignment and extending methods of parallel corpora for the Kazakh language. The Kazakh language is agglutinative with rich morphology and related to the Turkic language group. So, the traditional alignment methods for similar languages do not work for the Kazakh language. The alignment is used primarily to ensure that the fragment corresponding to the original is found in the translation. After that, identical fragments of parallel texts are compared with each other. At the initial stage, the question is what needs to be leveled. It is possible to align word by word, but this often becomes almost impossible for several reasons: sets of lexemes and expressions do not match in different languages. Considering the linguistic peculiarities of languages, the developed technologies and ways of universal alignment of parallel text may not work in languages with agglutination. It means that the form of the word is formed by additional affixes and auxiliary words that carry semantic and morphological information. The approach presented in this paper is to use a two-stage alignment, which uses a bilingual dictionary of synonyms. The evaluation with the use of the English-Kazakh corpus verifies that our method shows an average of 89 % correct alignment. The second method is designed to expand the parallel corpus due to the lack of natural parallel corpora of the Kazakh-English language pair with good quality. The developed method uses a combinatorial method taking into account the semantic and grammatical features of the Kazakh language. Different tenses of the Kazakh language are used for sentence generation, and different endings for parts of speech are also considered.У роботі представлені методи двоетапного вирівнювання та розширення паралельних корпусів казахської мови. Казахська мова є аглютинативною, має багату морфологію  та відноситься до тюркської мовної групи. Тому традиційні методи вирівнювання з подібними мовами не підходять для казахської мови. Вирівнювання використовується в першу чергу для знаходження у перекладі фрагмента, що відповідає оригіналу. Після цього ідентичні фрагменти паралельних текстів порівнюють один з одним. На початковому етапі питання полягає у тому, що підлягає вирівнюванню. Можна виконати послівне вирівнювання, але часто це стає практично неможливим з кількох причин: набори лексем та виразів у різних мовах не співпадають. Враховуючи лінгвістичні особливості мов, розроблені технології та способи універсального вирівнювання паралельного тексту можуть не підійти для мов з аглютинацією. Це означає, що форма слова утворюється додатковими афіксами та допоміжними словами, що несуть семантичну і морфологічну інформацію. Підхід, представлений в даній роботі, полягає у застосуванні двоетапного вирівнювання з використанням двомовного словника синонімів. Оцінка з використанням англо-казахського корпусу підтверджує правильність вирівнювання нашим методом в середньому на 89 %. Другий метод призначений для розширення паралельного корпусу у зв'язку із відсутністю хорошої якості природних паралельних корпусів казахсько-англійської мовної пари. У розробленому методі використовується комбінаторна техніка з урахуванням семантичних та граматичних особливостей казахської мови. Для побудови речень використовують різні часи казахської мови, а також враховуються різні закінчення частин мови

    Neural machine translation system for the Kazakh language based on synthetic corpora

    No full text
    The lack of big parallel data is present for the Kazakh language. This problem seriously impairs the quality of machine translation from and into Kazakh. This article considers the neural machine translation of the Kazakh language on the basis of synthetic corpora. The Kazakh language belongs to the Turkic languages, which are characterised by rich morphology. Neural machine translation of natural languages requires large training data. The article will show the model for the creation of synthetic corpora, namely the generation of sentences based on complete suffixes for the Kazakh language. The novelty of this approach of the synthetic corpora generation for the Kazakh language is the generation of sentences on the basis of the complete system of suffixes of the Kazakh language. By using generated synthetic corpora we are improving the translation quality in neural machine translation of Kazakh-English and Kazakh-Russian pairs

    Semantic Connections in the Complex Sentences for Post-Editing Machine Translation in the Kazakh Language

    No full text
    The problems of machine translation are constantly arising. While the most advanced translation platforms, such as Google and Yandex, allow for high-quality translations of languages with simple grammatical structures, more morphologically rich languages still suffer from the translation of complex sentences, and translation services leave many structural errors. This study focused on designing the rules for the grammatical structures of complex sentences in the Kazakh language, which has a difficult grammar with many rules. First, the types of complex sentences in the Kazakh language were thoroughly observed with the use of templates from the FuzzyWuzzy library. Then, the correction of complex sentences was completed with parallel corpora. The sentences were translated into English and Russian by existing machine translation systems. Therefore, the grammar of both Kazakh–English and Kazakh–Russian language pairs was considered. They both used the rules specifically designed for the post-editing steps. Finally, the performance of the developed algorithm was evaluated for an accuracy score for each pair of languages. This approach was then proposed for use in other corpora generation, post-editing, and analysis systems in future works

    Semantic Connections in the Complex Sentences for Post-Editing Machine Translation in the Kazakh Language

    No full text
    The problems of machine translation are constantly arising. While the most advanced translation platforms, such as Google and Yandex, allow for high-quality translations of languages with simple grammatical structures, more morphologically rich languages still suffer from the translation of complex sentences, and translation services leave many structural errors. This study focused on designing the rules for the grammatical structures of complex sentences in the Kazakh language, which has a difficult grammar with many rules. First, the types of complex sentences in the Kazakh language were thoroughly observed with the use of templates from the FuzzyWuzzy library. Then, the correction of complex sentences was completed with parallel corpora. The sentences were translated into English and Russian by existing machine translation systems. Therefore, the grammar of both Kazakh–English and Kazakh–Russian language pairs was considered. They both used the rules specifically designed for the post-editing steps. Finally, the performance of the developed algorithm was evaluated for an accuracy score for each pair of languages. This approach was then proposed for use in other corpora generation, post-editing, and analysis systems in future works
    corecore